#1. Import data
library(UsingR)
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(fpp)
## Loading required package: fma
## 
## Attaching package: 'fma'
## The following object is masked from 'package:UsingR':
## 
##     chicken
## The following objects are masked from 'package:MASS':
## 
##     cement, housing, petrol
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: tseries
library(lmtest)
library(MASS)
mydata <- read.csv("/Users/angela1230/Desktop/TOTALSA.csv")
#2. Plot and Inference
sales_ts <- ts(mydata$Sales.Units.in.Millions., start = c(2019,01), frequency = 12)
plot(sales_ts, main = "Time series for car sales", ylab = "sales")

Summaryof times series: The time series plot shows significant car sales drop around the first quarter of 2020 and which is probably due to the external shock of the COVID-19 pandemic. After the drop, there is a quick recovery in sales. And the trend gradually increase from 2022 to 2024, showing a steady rcovery and growth in car sales.
#3. Central Tendency: the min is 8.944, max is 18.697, mean is 15.612, median is 15.938, 1st quartile is 14.189, and 3rd quartile is 16.966 for the times series. 
summary(sales_ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.944  14.189  15.938  15.612  16.966  18.697
boxplot(sales_ts, main="Box Plot of Car Sales", ylab="Sales")

Summary observation of box plot: the summary stats shows a slightly skewed distribution. the minimum value is the drop in 2020 and the maximum value is the peak of the quick recovery of sales. In the box plot, there is one clear outlier, which is the drop in the times series plot and the minimum value in the summary stats.
#4. Decomposition
decomp <- decompose(sales_ts)
plot(decomp)

decomp$seasonal
##              Jan         Feb         Mar         Apr         May         Jun
## 2019  0.72247517  0.43065226 -0.17313941 -0.51546233 -0.37573316 -0.28201441
## 2020  0.72247517  0.43065226 -0.17313941 -0.51546233 -0.37573316 -0.28201441
## 2021  0.72247517  0.43065226 -0.17313941 -0.51546233 -0.37573316 -0.28201441
## 2022  0.72247517  0.43065226 -0.17313941 -0.51546233 -0.37573316 -0.28201441
## 2023  0.72247517  0.43065226 -0.17313941 -0.51546233 -0.37573316 -0.28201441
## 2024  0.72247517  0.43065226                                                
##              Jul         Aug         Sep         Oct         Nov         Dec
## 2019  0.05948767 -0.06814566 -0.09667066  0.08566267  0.15813142  0.05475642
## 2020  0.05948767 -0.06814566 -0.09667066  0.08566267  0.15813142  0.05475642
## 2021  0.05948767 -0.06814566 -0.09667066  0.08566267  0.15813142  0.05475642
## 2022  0.05948767 -0.06814566 -0.09667066  0.08566267  0.15813142  0.05475642
## 2023  0.05948767 -0.06814566 -0.09667066  0.08566267  0.15813142  0.05475642
## 2024
seasadj(decomp)
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2019 16.247525 16.531348 18.015139 17.483462 18.342733 18.063014 17.622512
## 2020 16.170525 16.657348 12.009139  9.459462 12.883733 13.752014 15.165512
## 2021 15.974525 15.693348 18.669139 19.212462 17.841733 16.185014 15.147512
## 2022 14.143525 13.737348 14.426139 15.196462 13.661733 13.951014 13.807512
## 2023 14.942525 15.003348 15.841139 16.961462 16.323733 16.710014 16.242512
## 2024 14.790525 15.760348                                                  
##            Aug       Sep       Oct       Nov       Dec
## 2019 17.950146 17.816671 16.963337 17.390869 17.436244
## 2020 16.000146 16.835671 16.527337 16.193869 16.917244
## 2021 13.833146 12.818671 13.095337 13.325869 13.111244
## 2022 14.132146 14.232671 15.005337 14.574869 13.880244
## 2023 15.977146 16.289671 15.713337 15.786869 16.331244
## 2024
plot(sales_ts)
lines(seasadj(decomp), col = "blue")

Summary: The decomposition plot shows a clear seasonality. The decomposition is additive as the plot title shown. The seasonal monthly indices shows the car sales is highest in January, which is around 0.72. And the lowest is in April, which is around -0.51. This may cause by the discount that the dealerships may need to sale the car of last year’s model. After holiday season and after that the consumer don’t need to buy more cars, and there is no better discount so the sales are getting lower during spring. The seasonally adjusted series shows a smoother pattern and does not have big fluctuations to the value of times series.
#5. Naive Method
naive_forecast <- naive(sales_ts, h = 12)
plot(naive_forecast)

#plot residual: the residuals plot shows a random fluctuations around zero, which means there is no clear trend but some variability, like the one in 2020.
plot(naive_forecast$residuals, main="Residuals of Naïve Forecast", ylab="Residuals", xlab="Time")

#histogram plot for residual: it shows the residuals are mostly centered around zero but seems slightly skewed and has outliers.
hist(naive_forecast$residuals, main="Histogram of Residuals", xlab="Residuals")

#fitted values vs. residuals: it shows that there is no strong patter in naive method.
plot(fitted(naive_forecast), naive_forecast$residuals, main="Fitted Values vs Residuals", xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")

#actual values vs. residuals: this shows a similar variability with the fitted value plots but with a higher deviations in the period of the extreme values in 2020.
plot(sales_ts, naive_forecast$residuals, main="Actual Values vs Residuals", xlab="Actual Values", ylab="Residuals")
abline(h=0, col="red")

#acf plot for residual: this plot shows a weak autocorrelation for some lags and this means the naive method's dependencies does not fully take into account.
Acf(naive_forecast$residuals, main="ACF of Residuals")

#five measures of accuracy: The MAPE shows that the forecast errors are about 5.48% of the acutal values. However, the ACF1=0.0707 shows the naive method does not fully capture the patterns in the data.
accuracy(naive_forecast)
##                       ME     RMSE       MAE        MPE    MAPE      MASE
## Training set -0.01277049 1.184281 0.7771311 -0.4826249 5.48124 0.3223567
##                   ACF1
## Training set 0.0706958
#forecast: the forecast for the next year stay the same as the last observed value.
plot(naive_forecast, main="Naïve Forecast for Next Year", xlab="Time", ylab="Sales")

print(naive_forecast)
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## Mar 2024         16.191 14.67328 17.70872 13.869852 18.51215
## Apr 2024         16.191 14.04462 18.33738 12.908401 19.47360
## May 2024         16.191 13.56224 18.81976 12.170654 20.21135
## Jun 2024         16.191 13.15557 19.22643 11.548704 20.83330
## Jul 2024         16.191 12.79728 19.58472 11.000755 21.38124
## Aug 2024         16.191 12.47337 19.90863 10.505372 21.87663
## Sep 2024         16.191 12.17550 20.20650 10.049820 22.33218
## Oct 2024         16.191 11.89825 20.48375  9.625802 22.75620
## Nov 2024         16.191 11.63785 20.74415  9.227556 23.15444
## Dec 2024         16.191 11.39156 20.99044  8.850885 23.53111
## Jan 2025         16.191 11.15730 21.22470  8.492623 23.88938
## Feb 2025         16.191 10.93347 21.44853  8.150307 24.23169
#6. Simple Moving Averages: as the order increase, the plot becomes smoother and reduce the short-term fluctuations and noise in the data. The higher-order ma capture the trend more effectively. But the lower-order ma can capture the recent changes in times series.
plot(sales_ts)
lines(ma(sales_ts, order=3), col="red")   
lines(ma(sales_ts, order=6), col="blue") 
lines(ma(sales_ts, order=9), col="green")

forecast_ma <- forecast(ma(sales_ts, order=6), h=12)
## Warning in ets(object, lambda = lambda, biasadj = biasadj,
## allow.multiplicative.trend = allow.multiplicative.trend, : Missing values
## encountered. Using longest contiguous portion of time series
plot(forecast_ma)

#7. simple smoothing: the value of alpha is 0.999. which means the model take the most recent data highly relevant for forecasting. The initial state value is 16.9716 and the sigma value is 1.1941. which signify the standard deviation of the residuals.
ets_model <- ets(sales_ts)
summary(ets_model)
## ETS(A,N,N) 
## 
## Call:
## ets(y = sales_ts)
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 16.9716 
## 
##   sigma:  1.1941
## 
##      AIC     AICc      BIC 
## 281.8479 282.2617 288.2293 
## 
## Training set error measures:
##                       ME   RMSE       MAE       MPE     MAPE      MASE
## Training set -0.01259346 1.1747 0.7646209 -0.475062 5.393013 0.3171674
##                    ACF1
## Training set 0.07078222
ets_forecast <- forecast(ets_model, h = 12)
plot(ets_forecast)

#plot residual: residual plot shows a random fluctuation around zero with some large deviations, like the on in 2020.
plot(ets_forecast$residuals, main="Residuals of Simple Smoothing Forecast", ylab="Residuals", xlab="Time")

#histogram plot for residual: it shows the residuals are approximately centerned around zero but are not entirely normal because of the outliers and skewness. 
hist(ets_forecast$residuals, main="Histogram of Residuals", xlab="Residuals")

#fitted values vs. residuals: there is no clear pattern in this model.
plot(fitted(ets_forecast), ets_forecast$residuals, main="Fitted Values vs Residuals", xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")

#actual values vs. residuals: it shows the uneven variability with higer residuals for certain actual values. This shows the model will perform poorly in the periods that have significant changes.
plot(sales_ts, ets_forecast$residuals, main="Actual Values vs Residuals", xlab="Actual Values", ylab="Residuals")
abline(h=0, col="red")

#acf plot for residual: It shows a weak autocorrelatiosn in residuals for some lags, this means the model's dependencies does not fully been captured.
Acf(ets_forecast$residuals, main="ACF of Residuals")

#forecast
plot(ets_forecast)

print(ets_forecast)
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## Mar 2024       16.19093 14.66061 17.72126 13.850504 18.53136
## Apr 2024       16.19093 14.02684 18.35503 12.881233 19.50063
## May 2024       16.19093 13.54051 18.84135 12.137463 20.24440
## Jun 2024       16.19093 13.13052 19.25135 11.510429 20.87144
## Jul 2024       16.19093 12.76930 19.61256 10.957996 21.42387
## Aug 2024       16.19093 12.44273 19.93913 10.458558 21.92331
## Sep 2024       16.19093 12.14242 20.23944  9.999275 22.38259
## Oct 2024       16.19093 11.86290 20.51896  9.571784 22.81008
## Nov 2024       16.19093 11.60037 20.78149  9.170275 23.21159
## Dec 2024       16.19093 11.35206 21.02980  8.790518 23.59135
## Jan 2025       16.19093 11.11589 21.26598  8.429319 23.95254
## Feb 2025       16.19093 10.89022 21.49164  8.084198 24.29767
#5 measures of accuracy: the model's MAPE is 5.39%, which measn the forecast errors are relatively small compared to the actual values. but the acf1 = 0.0708 shows the model are not addressing the temporal dependencies.It works well with stable series.
accuracy(ets_forecast)
##                       ME   RMSE       MAE       MPE     MAPE      MASE
## Training set -0.01259346 1.1747 0.7646209 -0.475062 5.393013 0.3171674
##                    ACF1
## Training set 0.07078222
#8. Holt-Winters: the alpha value is 1 and it means the most recent observation is used to predict. The beta is 1 and it signifies the trend is highly related to the recent data. the gamma is 1 and it means the seasonal component adjustment is entirely based on the lastest seasonal patter. the value of sigma is approximately 1.1941 and it is signifies the standard deviation of the residuals. 
hw_forecast <- HoltWinters(sales_ts)
summary(hw_forecast)
##              Length Class  Mode     
## fitted       200    mts    numeric  
## x             62    ts     numeric  
## alpha          1    -none- numeric  
## beta           1    -none- numeric  
## gamma          1    -none- numeric  
## coefficients  14    -none- numeric  
## seasonal       1    -none- character
## SSE            1    -none- numeric  
## call           2    -none- call
hw_pred <- forecast(hw_forecast, h = 12)
plot(hw_pred)

#plot residual: it shows the variability over time and with some spikes. this shows a certain patterns or outliers may not been captured entirely by the model.
plot(hw_pred$residuals, main="Residuals of Simple Smoothing Forecast", ylab="Residuals", xlab="Time")

#histogram plot for residual: it shows an approximately distribution around zero with some skewness and outliers.
hist(hw_pred$residuals, main="Histogram of Residuals", xlab="Residuals")

#fitted values vs. residuals: this shows that the model has no significant bias in the forecast and some variability in residuals could be fixed.
plot(fitted(hw_pred), hw_pred$residuals, main="Fitted Values vs Residuals", xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")

#actual values vs. residuals:it shows no clear relationship between the acutal calues and residuals. this means the model captures the main trend of the data.
plot(sales_ts, hw_pred$residuals, main="Actual Values vs Residuals", xlab="Actual Values", ylab="Residuals")
abline(h=0, col="red")

#acf plot for residual: it shows some significant lags and means there might still be autocorrelation in the residuals.
Acf(hw_pred$residuals, main="ACF of Residuals")

#forecast
plot(hw_pred)

print(hw_pred)
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Mar 2024       14.53712 11.863655 17.21059 10.448406 18.62584
## Apr 2024       12.37563  9.152812 15.59845  7.446756 17.30450
## May 2024       10.51822  6.826930 14.20952  4.872876 16.16357
## Jun 2024       10.66407  6.557397 14.77074  4.383455 16.94469
## Jul 2024       11.27031  6.786577 15.75405  4.413030 18.12759
## Aug 2024       11.75506  6.923600 16.58652  4.365981 19.14414
## Sep 2024       12.30912  7.153337 17.46490  4.424030 20.19421
## Oct 2024       12.59542  7.134538 18.05630  4.243722 20.94711
## Nov 2024       12.77769  7.027876 18.52750  3.984110 21.57127
## Dec 2024       12.88147  6.856570 18.90637  3.667179 22.09576
## Jan 2025       13.01780  6.729832 19.30577  3.401182 22.63442
## Feb 2025       12.95541  6.414947 19.49587  2.952635 22.95819
#5 measures of accuracy: the MAPE shows that 9.14% of the forecasting average error is less than the actual value.
accuracy(hw_pred)
##                     ME     RMSE      MAE     MPE     MAPE      MASE     ACF1
## Training set 0.2810875 2.084194 1.442687 1.49868 9.140589 0.5984317 0.381969
#9. ARIMA or Box-Jenkins: The test shows the p-value is 0.09624 which is greater than 0.05. this means the data is not stationary. It will need one level of differencing. The model don't need seasonality component since the arima model does not show a significant seasonal lags. the possible ARIMA model include ARIMA(1,1,0), ARIMA(0,1,1), ARIMA(1,1,1). The ARIMA(1,0,0)(0,0,1)[12] has AIC=194.03, BIC=202.54, and Sigma^2=1.188. I will select ARIMA(1,0,0). The final formula is Xt = 15.5860 + 0.7671 Xt-1 - 0.3536Xt-2 + et
adf.test(sales_ts) 
## 
##  Augmented Dickey-Fuller Test
## 
## data:  sales_ts
## Dickey-Fuller = -3.1972, Lag order = 3, p-value = 0.09624
## alternative hypothesis: stationary
diff_data <- diff(sales_ts)

Acf(diff_data)

Pacf(diff_data)

fit_arima <- auto.arima(sales_ts)
summary(fit_arima)
## Series: sales_ts 
## ARIMA(1,0,0)(0,0,1)[12] with non-zero mean 
## 
## Coefficients:
##          ar1     sma1     mean
##       0.7671  -0.3536  15.5860
## s.e.  0.0815   0.1912   0.4011
## 
## sigma^2 = 1.188:  log likelihood = -93.01
## AIC=194.03   AICc=194.73   BIC=202.54
## 
## Training set error measures:
##                       ME     RMSE       MAE        MPE     MAPE      MASE
## Training set 0.007369114 1.063385 0.7077504 -0.5744392 5.043681 0.2935773
##                  ACF1
## Training set 0.109244
arima_forecast <- forecast(fit_arima, h=12)
plot(arima_forecast)

#plot residual: it shows the residuals are spotted randomly around zero. it means the arima model capture the main patterns.
plot(fit_arima$residuals,  main="Residuals of ARIMA", xlab="Residuals", ylab="Time")

#histogram plot for residual: it shows the residuals approximately around zero with some skewness or outliers in one side.
hist(fit_arima$residuals,  main="Histogram of Residuals", xlab="Residuals")

#fitted values vs. residuals: the residuals are scattered around zero with no pattern but they clustered together. 
plot(fitted(fit_arima), fit_arima$residuals, main="Fitted Values vs Residuals", xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")

#actual values vs. residuals:it shows that some clustering in some ranges of actual values.
plot(sales_ts, fit_arima$residuals, main="Actual Values vs Residuals", xlab="Actual Values", ylab="Residuals")
abline(h=0, col="red")

#acf plot for residual: it shows most lags are in the confidence bounds. the residuals are mostly uncorrelated and this means the arima model capture the correlation in the data well. 
Acf(fit_arima$residuals, main="ACF of Residuals")

#forecast
plot(arima_forecast)

print(arima_forecast)
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Mar 2024       15.96989 14.57289 17.36690 13.83336 18.10642
## Apr 2024       15.60472 13.84403 17.36540 12.91198 18.29745
## May 2024       15.89106 13.94792 17.83419 12.91929 18.86283
## Jun 2024       15.68345 13.64056 17.72635 12.55912 18.80779
## Jul 2024       15.67279 13.57341 17.77217 12.46207 18.88352
## Aug 2024       15.81221 13.68029 17.94413 12.55172 19.07270
## Sep 2024       15.71662 13.56578 17.86745 12.42719 19.00604
## Oct 2024       15.70558 13.54369 17.86747 12.39925 19.01191
## Nov 2024       15.67453 13.50616 17.84290 12.35829 18.99076
## Dec 2024       15.61335 13.44117 17.78552 12.29129 18.93540
## Jan 2025       15.63011 13.45571 17.80452 12.30465 18.95558
## Feb 2025       15.45260 13.27688 17.62832 12.12513 18.78008
plot(forecast(fit_arima, h=24))

print(forecast(fit_arima, h=24))
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Mar 2024       15.96989 14.57289 17.36690 13.83336 18.10642
## Apr 2024       15.60472 13.84403 17.36540 12.91198 18.29745
## May 2024       15.89106 13.94792 17.83419 12.91929 18.86283
## Jun 2024       15.68345 13.64056 17.72635 12.55912 18.80779
## Jul 2024       15.67279 13.57341 17.77217 12.46207 18.88352
## Aug 2024       15.81221 13.68029 17.94413 12.55172 19.07270
## Sep 2024       15.71662 13.56578 17.86745 12.42719 19.00604
## Oct 2024       15.70558 13.54369 17.86747 12.39925 19.01191
## Nov 2024       15.67453 13.50616 17.84290 12.35829 18.99076
## Dec 2024       15.61335 13.44117 17.78552 12.29129 18.93540
## Jan 2025       15.63011 13.45571 17.80452 12.30465 18.95558
## Feb 2025       15.45260 13.27688 17.62832 12.12513 18.78008
## Mar 2025       15.48368 13.26472 17.70264 12.09008 18.87729
## Apr 2025       15.50752 13.26351 17.75154 12.07560 18.93945
## May 2025       15.52581 13.26719 17.78444 12.07154 18.98008
## Jun 2025       15.53984 13.27266 17.80702 12.07249 19.00719
## Jul 2025       15.55060 13.27840 17.82280 12.07557 19.02563
## Aug 2025       15.55886 13.28371 17.83401 12.07932 19.03840
## Sep 2025       15.56519 13.28831 17.84207 12.08300 19.04738
## Oct 2025       15.57005 13.29215 17.84795 12.08630 19.05379
## Nov 2025       15.57377 13.29527 17.85227 12.08911 19.05844
## Dec 2025       15.57663 13.29778 17.85548 12.09143 19.06184
## Jan 2026       15.57882 13.29977 17.85788 12.09330 19.06435
## Feb 2026       15.58051 13.30133 17.85969 12.09480 19.06621
#5 measures of accuracy: the MAPE value of 5.04% shows that the forecasting error is low. And the residual analysis shows no significant autocorrelation, this means the model capture the patterns in the data well.
accuracy(arima_forecast)
##                       ME     RMSE       MAE        MPE     MAPE      MASE
## Training set 0.007369114 1.063385 0.7077504 -0.5744392 5.043681 0.2935773
##                  ACF1
## Training set 0.109244
#10. Accuracy Summary: naive uses the most recent observation as the best predictor for the future values and it is useful for times series that has high randomness without having other patterns. Simple moving average use the average of the defined order observations to forecast the future values and it is useful to smooth the fluctuations with series with no clear patters. ets uses exponential smoothing with error, trends, and seasonality components and is useful for data that has patterns. holt-winters use level, trend, and seasonality to conduct exponential smoothing and its useful for data with strong seasonal patterns. ARIMA is a model use autoregressive, integrated, and moving average compenents and it is useful for both trend and non-seasonal data but has clear autocorrelation.
accuracy_naive <- accuracy(naive_forecast)
accuracy_ma <- accuracy(forecast_ma)
accuracy_ets <- accuracy(ets_forecast)
accuracy_hw <- accuracy(hw_pred)
accuracy_arima <- accuracy(arima_forecast)

#Best and Worst Forecast Method for Accuracy Measures: for ME, best method is ETS and worst is Holt-winters; for RMSE, best method is ARIMA and worst is Holt-winter; for MAE, best is ARIMA and worst is holt-winters; for MAPE, best is ARIMA and worst si holt-winters, for ACF1, best is Naive and worst is holt-winters.
print(accuracy_naive)
##                       ME     RMSE       MAE        MPE    MAPE      MASE
## Training set -0.01277049 1.184281 0.7771311 -0.4826249 5.48124 0.3223567
##                   ACF1
## Training set 0.0706958
print(accuracy_ma)
##                        ME      RMSE       MAE       MPE     MAPE       MASE
## Training set -0.001621659 0.2179797 0.1522062 0.0572464 1.013278 0.07086712
##                   ACF1
## Training set 0.5553972
print(accuracy_ets)
##                       ME   RMSE       MAE       MPE     MAPE      MASE
## Training set -0.01259346 1.1747 0.7646209 -0.475062 5.393013 0.3171674
##                    ACF1
## Training set 0.07078222
print(accuracy_hw)
##                     ME     RMSE      MAE     MPE     MAPE      MASE     ACF1
## Training set 0.2810875 2.084194 1.442687 1.49868 9.140589 0.5984317 0.381969
print(accuracy_arima)
##                       ME     RMSE       MAE        MPE     MAPE      MASE
## Training set 0.007369114 1.063385 0.7077504 -0.5744392 5.043681 0.2935773
##                  ACF1
## Training set 0.109244
#11. Conclusion: The times series of car sales shows a flucturation with a drop and been recovered after that. Then it is steady in recent years with a slightly upward trend. its seasonal patterns show that the data is predictable. The forecasts suggest that the value of car sales will increase over the next year and the following one. The ARIMA model ranks the higest for its accuracy and is capturing trends and patterns. After that ETS is the second good forecast model with a strong trend-seasonality model. The simple moving average smooth the patterns and the naive methods also shows a fine accuracy. However, the Holt-winters methods is not perform well might be it may not adapt well with the dataset's trend. After all, ARIMA model is the most reliable long-term forecasting model for this car sales dataset.